Challenges in clinical natural language processing for automated disorder normalization

نویسندگان

  • Robert Leaman
  • Ritu Khare
  • Zhiyong Lu
چکیده

BACKGROUND Identifying key variables such as disorders within the clinical narratives in electronic health records has wide-ranging applications within clinical practice and biomedical research. Previous research has demonstrated reduced performance of disorder named entity recognition (NER) and normalization (or grounding) in clinical narratives than in biomedical publications. In this work, we aim to identify the cause for this performance difference and introduce general solutions. METHODS We use closure properties to compare the richness of the vocabulary in clinical narrative text to biomedical publications. We approach both disorder NER and normalization using machine learning methodologies. Our NER methodology is based on linear-chain conditional random fields with a rich feature approach, and we introduce several improvements to enhance the lexical knowledge of the NER system. Our normalization method - never previously applied to clinical data - uses pairwise learning to rank to automatically learn term variation directly from the training data. RESULTS We find that while the size of the overall vocabulary is similar between clinical narrative and biomedical publications, clinical narrative uses a richer terminology to describe disorders than publications. We apply our system, DNorm-C, to locate disorder mentions and in the clinical narratives from the recent ShARe/CLEF eHealth Task. For NER (strict span-only), our system achieves precision=0.797, recall=0.713, f-score=0.753. For the normalization task (strict span+concept) it achieves precision=0.712, recall=0.637, f-score=0.672. The improvements described in this article increase the NER f-score by 0.039 and the normalization f-score by 0.036. We also describe a high recall version of the NER, which increases the normalization recall to as high as 0.744, albeit with reduced precision. DISCUSSION We perform an error analysis, demonstrating that NER errors outnumber normalization errors by more than 4-to-1. Abbreviations and acronyms are found to be frequent causes of error, in addition to the mentions the annotators were not able to identify within the scope of the controlled vocabulary. CONCLUSION Disorder mentions in text from clinical narratives use a rich vocabulary that results in high term variation, which we believe to be one of the primary causes of reduced performance in clinical narrative. We show that pairwise learning to rank offers high performance in this context, and introduce several lexical enhancements - generalizable to other clinical NER tasks - that improve the ability of the NER system to handle this variation. DNorm-C is a high performing, open source system for disorders in clinical text, and a promising step toward NER and normalization methods that are trainable to a wide variety of domains and entities. (DNorm-C is open source software, and is available with a trained model at the DNorm demonstration website: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#DNorm.).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BioinformaticsUA: Machine Learning and Rule-Based Recognition of Disorders and Clinical Attributes from Patient Notes

Natural language processing and text analysis methods offer the potential of uncovering hidden associations from large amounts of unprocessed texts. The SemEval-2015 Analysis of Clinical Text task aimed at fostering research on the application of these methods in the clinical domain. The proposed task consisted of disorder identification with normalization to SNOMED-CT concepts, and disorder at...

متن کامل

Role of Natural Language Processing in Information Retrieval; Challenges and Opportunities

This paper aims to analyze the role of natural language processing (NLP). The paper will discuss the role in the context of automated data retrieval, automated question answer, and text structuring. NLP techniques are gaining wider acceptance in real life applications and industrial concerns. There are various complexities involved in processing the text of natural language that could satisfy t...

متن کامل

A Cascaded Approach for Social Media Text Normalization of Turkish

Text normalization is an indispensable stage for natural language processing of social media data with available NLP tools. We divide the normalization problem into 7 categories, namely; letter case transformation, replacement rules & lexicon lookup, proper noun detection, deasciification, vowel restoration, accent normalization and spelling correction. We propose a cascaded approach where each...

متن کامل

Time Out of Joint in Temporal Annotations of Texts: Challenges for Artificial Intelligence and Human Computer Interaction

Starting from the experience of the TERENCE European project, the paper shows challenges that require a combined effort of natural language processing, automated temporal reasoning and, finally, human computer interaction. The paper starts introducing the problem of producing high quality temporal annotations for texts, and argues for a combined automated temporal reasoning and natural processi...

متن کامل

Post-Traumatic Stress Disorder (PTSD) Ontology and Use Case

Ontologies play an increasingly important role in annotation, integration, and analysis of biomedical data. In this paper, we describe the design and development of a PostTraumatic Stress Disorder (PTSD) Ontology and how we can use this ontology as a controlled vocabulary for supporting automatic annotation of clinical text. The automated annotation is performed using a natural language process...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of biomedical informatics

دوره 57  شماره 

صفحات  -

تاریخ انتشار 2015